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Reliable experimental demonstrations of violations of local realism are highly desirable for fun- 
damental tests of quantum mechanics. One can quantify the violation witnessed by an experiment 
in terms of a statistical p-value, which can be defined as the maximum probability according to 
local realism of a violation at least as high as that witnessed. Thus, high violation corresponds 
to small p-value. We propose a prediction-based-ratio (PBR) analysis protocol whose p-values are 
valid even if the prepared quantum state varies arbitrarily and local realistic models can depend on 
previous measurement settings and outcomes. It is therefore not subject to the memory loophole 
[J. Barrett et al., Phys. Rev. A 66, 042111 (2002)]. If the prepared state does not vary in time, 
the p-values are asymptotically optimal. For comparison, we consider protocols derived from the 
number of standard deviations of violation of a Bell inequality and from martingale theory [R. Gill, 
arXiv:quant-ph/0110137]. We find that the p-values of the former can be too small and are therefore 
not statistically valid, while those derived from the latter are sub-optimal. PBR p-values do not 
require a predetermined Bell inequality and can be used to compare results from different tests of 
local realism independent of experimental details. 

PACS numbers: 03.65.Ud, 03.65.Ta, 02.50.Tt 



I. INTRODUCTION 

Quantum mechanics violates local realism (LR) [Tj . To 
show such violation, experimenters usually test a Bell 
inequality that is satisfied by all local realistic models 
(LR models) such as the Clauser-Horne-Shimony-Holt 
(CHSH) inequality g] 

Jchsh = E{A l B l )+E{A 1 B 2 )+E{A 2 B l )-E{A 2 B 2 ) < 2, 

(1) 

where E(AiBj) with i,j e {1,2} is the correlation be- 
tween measurements A^ and Bj with outcomes ±1. To 
test this inequality, each of two parties — Alice and Bob — 
receives one particle from a common source. Each per- 
forms one of two possible measurements chosen randomly 
and independently on their own particle and records the 
outcome. We call this procedure a trial. After a large 
number of trials, Alice and Bob estimate the CHSH ex- 
pression Ichsh, which is the left-hand side of the CHSH 
inequality, from their joint measurement outcomes. Fol- 
lowing this approach, the departure from LR is typically 
given in terms of the number of experimental standard 
deviations (SDs) separating the estimate of Jchsh from 
its LR upper bound of 2. For example, Weihs et al. [3] 
report an experimental estimate Ichsh = 2.73 ±0.02 and 
claim a violation of the CHSH inequality by 30 SDs. 

There are several problems with this analysis protocol. 
First, although the SD partially characterizes the mea- 
surement uncertainty due to a finite number of trials, 
it does not consider the probability that a local realis- 
tic system could also violate the inequality after a finite 
number of trials. Because such a system's (non-)violation 
can have a larger SD, the experimental SD may suggest 
a stronger violation of LR than justified. Second, one 



would expect that the probability distribution of the es- 
timate of Ichsh under LR is Gaussian, since this appears 
to be justified by the central limit theorem [J] as the 
number of trials approaches infinity. It therefore seems 
reasonable to statistically quantify the violation by the 
probability that a Gaussian random variable can exceed 
the mean by the number of SDs of violation experimen- 
tally observed. However, for a finite number of trials and 
high violation, the Gaussianity assumption fails. Third, 
it is desirable to compare experimental results from dif- 
ferent tests of LR, but the effects of the problem with ex- 
perimental SDs and of the failure of Gaussianity depend 
on the Bell inequality, the quantum state, measurement 
settings, detection efficiency, and other experimental pa- 
rameters. Consequently, the number of SDs of violation 
cannot be used to directly compare the amount of evi- 
dence for rejecting local realism obtained from different 
experimental tests. 

In this paper, we show how to analyze data from exper- 
imental tests of LR to compute a measure of the strength 
of the evidence against LR. By computing this measure, 
LR violation by different experiments can be rigorously 
assessed and compared. Specifically, the proposed analy- 
sis protocol quantifies LR violation in terms of p-values, 
where small p-values imply strong violation. We call 
this the prediction-based-ratio (PBR) protocol. Proto- 
cols such as this compute a p-value from a "test statis- 
tic" , that is, a value T"(x) computed from the data x. 
There are many such statistics to choose from; an exam- 
ple is the average Bell-inequality violation and is used by 
the SD-based protocol. The p-value returned by the pro- 
tocol is computed from a putative upper bound b(t) on 
the tail probabilities Prob(T(x) > t) for x distributed ac- 
cording to LR models. The p-value of the protocol given 
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the observed data x is defined by p( prot) = 6(T(x)). In 
order to be able to interpret the protocol's p-value as a 
measure of LR violation, it must satisfy statistical valid- 
ity: The protocol and its p-values are valid if the bound 
b{t) > Prob(T(x) > t) is true whenever x is distributed 
according to an LR model. See App.[T]for a discussion of 
the relevant statistical concepts and justification for the 
use of p- values. 

We prove that the PBR protocol is valid and compare 
it to SD- and martingale-based [6] protocols. For n 
independent and identically distributed trials, these pro- 
tocols have the property that the p-value p is exponen- 
tially close to 0. That is, p ~ 2~ Gn for large n. We call 
G the asymptotic confidence-gain rate. It is desirable 
to have a high confidence-gain rate as this implies that 
fewer trials are needed to achieve the same strength of 
violation of LR. The optimal confidence-gain rate that 
can be achieved by any protocol is given by the statis- 
tical strength S in units of bits per trial as defined in 
Ref. [7J. We prove that the PBR protocol is asymptot- 
ically optimal. That is, its p-values achieve the optimal 
confidence-gain rate. The confidence-gain rates for dif- 
ferent protocols are shown in Figs. [T] and [2] for a number 
of experimental configurations that are explained in the 
next section. The figures show that SD-based p-values 
are not valid in some regions. Because the relation- 
ship of the SD-based confidence-gain rates compared to 
the asymptotically optimal ones varies substantially, re- 
sults of experiments with different configurations cannot 
be directly compared by the common "number of SDs 
of violation" measure. The martingale-based protocol 
is valid and computationally simple but has suboptimal 
confidence-gain rates. 

The PBR protocol remains valid if the prepared quan- 
tum state varies arbitrarily and the LR models to be 
rejected depend on previous measurement settings and 
outcomes, that is, in the presence of the memory ef- 
fect [Hj ■ This is desirable not only for tests of LR but also 
for practical applications of quantum information, such 
as device-independent quantum key distribution [9j [10] , 
randomness expansion |11) , state estimation [12] and cer- 
tification of entangled measurements [13] . 

Compared with the other two protocols, an advantage 
of the PBR protocol is that it can be applied to a wide 
variety of configurations (the combinations of quantum 
state, measurement settings and other relevant parame- 
ters) without having to specify a Bell inequality. Since 
such Bell inequalities characterize the family of setting 
and outcome distributions achievable by LR models, they 
provide a useful guide to designing an experiment and de- 
termining good goal configurations to be achieved. But 
since Bell-inequality violation is not directly related to 
statistical strength, it is not obvious how to choose the 
best inequality before the experiment. Moreover, the pre- 
determined Bell inequality restricts a successful experi- 
ment to configurations close to the goal, closer than may 
be achievable in a given experiment. The PBR protocol 
automatically adapts to deviations from the goal, achiev- 



ing optimal confidence-gain rates for the actual configu- 
ration. One can exploit this adaptability by applying the 
PBR protocol to experiments in progress. This makes it 
possible to monitor the current (non-) violation of LR for 
the purpose of optimizing experimental parameters and 
settings. The online ancillary files contain the code and 
documentation for an implementation of the PBR proto- 
col (the local realism analysis engine) that can be used 
for monitoring experiments in progress and for analyzing 
existing data sets. Our results show that the PBR pro- 
tocol is sufficiently efficient for practical use with typical 
experimental configurations. 

The paper is structured as follows: In Sec. [TTJ we sum- 
marize the mentioned methods for calculating p-values 
and show how their confidence-gain rates compare for 
tests of LR based on Bell inequalities. The methods are 
applied to and compared on simulated and actual exper- 
iments. The theory for the methods is in Sec. Ill Wc 



assume that the readers are familiar with the basics of 
LR and tests of LR based on Bell inequalities. For re- 
views of the field, see Refs. [THU7| . 



II. COMPARISON OF PROTOCOLS 

We consider three protocols that determine p-values 
for LR rejection from experimental data: SD-based, 
martingale-based, and PBR protocols. The first two de- 
pend on a Bell inequality, whereas the PBR protocol re- 
quires only the sequence of measurement settings and 
outcomes. 

For the purposes of discussion, we fix a Bell inequality 



(Hz)) < B, 



(2) 



where I(x) is a real-valued function of the measurement 
setting and outcome combination a; of a single trial, and 
/ = (I(x)) is its expectation. Here, the measurement set- 
ting distribution is built into the inequality. An example 
is the CHSH inequality in Eq. ([I]). In this case, if x's 
settings are i,j and its outcomes are a, b, then 



I(x) 



28 it 2$j,2) a b/Pi,j, 



and B 



(3) 



where Pij is the probability of choosing the setting com- 
bination i,j in each trial. The functional form I(x) in 
Eq. ([3| ensures that its expectation is equal to the left- 
hand side of the CHSH inequality ([!]). In particular, this 
requires dividing by the known probabilities of the mea- 
surement settings. There is no loss of generality by fix- 
ing the setting probabilities in advance. Violation of LR 
requires that measurement settings be chosen indepen- 
dently of any hidden variables. In particular, the locality 
and memory loopholes cannot be closed unless in each 
trial, measurement settings are chosen randomly and in- 
dependently by each party with no possibility of a causal 
connection and according to a known probability distri- 
bution. We allow for arbitrary setting distributions in 
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FIG. 1: Confidence-gain rates G for the SD-based protocol. 
G is shown for a CHSH test of LR with an unbalanced Bell 
state with no loss and perfect detectors. It depends on the pa- 
rameter 9 in the unbalanced Bell state cos(f9)|00) + sin(f?)|ll). 
The measurement settings are chosen to maximize the vio- 
lation of the CHSH inequality |lj). G is compared wi th the 
optimal gain rate given by the statistical strength (Sec. Ill C I 
for this test. The cross-over occurs at 6 = 33.41°. SD-based 
confidence-gain rates were computed with respect to the con- 
ventional method for estimating violation, see Sec. |III A| 



Eq. For the results in Figs. [IJ [2j [3] and [§ p id = 1/4. 

Given an experimentally obtained sequence of settings 
and outcomes x\ , . . . , x n from n trials, we get an estimate 
I = \ Sfc=i I{ x k) °f I- Note that this approach differs 
from the one where each expectation in Eq. is sepa- 
rately estimated by conditioning on the respective mea- 
surement settings, as is commonly done in experiments 
to produce an estimate I of I. The difference is discussed 
in Sec. |III A| and does not significantly affect the compar- 
isons made here. In this section we outline and compare 
the protocols. Technical details are in Sec. Ill 



A. SD-based Protocol 

The results from the trials are used to obtain I and es- 
timate the SD a of I. Given that I > B, it is conventional 
to give (7 — B)/a, the number of SDs of violation, as a 
measure of the amount of violation. If we pretend that 
the probability distribution of the estimate of I given LR 
is Gaussian with mean bounded by B and variance a 2 , 
we can compute a p- value 



where Q{z) is the Q-function, which is the probabil- 
ity that a standard normal random variable N satis- 
fies N > z. As a function of the number of trials n, 
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FIG. 2: The confidence-gain rate G of a CHSH test of LR 
with Bell states and varying detection efficiency r\ and visibil- 
ity V. The measurement settings are chosen to maximize the 
violation of the CHSH inequality Q. Measurement outcomes 
where no photon is detected are assigned the value —1. 



a^/n approaches <7i, where u\ is an effective one-trial 
SD. For large n, the quantity Q((I — B)/a) approaches 



-n(Z-S)7(2<rf) 



Thus the asymptotic confidence-gain 



rate for the SD-based protocol is 

ri-Bf 



G S d = log 2 (e)- 



2o\ 



(5) 



SD-based p- values are not valid because the experimen- 
tal SD is different from the worst-case SD assuming LR, 
and because deviations from Gaussianity in the extreme 
tail of the distribution for I cannot be asymptotically 
neglected. To explain this issue, define the random vari- 
able F = s/n{I - B)/a x . For any LR model, (F) < 0. 
We expect that according to the central limit theorem, 
F — (F) converges in distribution to a standard normal 
distribution. Assuming LR models have the same or a 
smaller SD, we are interested in the probability of the 
event that F > \pnV n j Vi, where V n is the violation of 
the Bell inequality found after n trials. But convergence 
in distribution cannot be used to compute probabilities 
of events that depend on n. 

A comparison of the confidence-gain rate for the SD- 
based protocol to the asymptotically optimal one is 
shown in Fig. [T] It implies that SD-based p-values can 
be lower than justified and are therefore not valid. The 
worst case is when the state used is a Bell state, i.e., a 
maximally entangled state of two qubits, which is an aim 
of most experiments to date. The family of unbalanced 
Bell states considered in Fig.[l]is of interest because they 
are more tolerant of low detection efficiency [18] . Exper- 
imental techniques to prepare arbitrary unbalanced Bell 
states without postselection have been demonstrated and 
applied to tests of LR PHHS]. 

The number of SDs of violation is not normally explic- 
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itly converted to a p-value as done here. Instead, it is 
primarily intended as a way of claiming successful vio- 
lation with a good signal-to-noise ratio. Naturally, one 
would like to use the measure to compare the strength 
of the violation for different experiments. Such a relative 
comparison works only if the experiments use the same 
test of LR with the same state, experimental settings, 
losses, visibilities, and other relevant parameters. From 
Fig. [l] we can infer that, if we use the number of SDs to 
compare the violation of the CHSH inequality in experi- 
ments involving different unbalanced Bell states, we tend 
to unfairly favor the experiment with the more balanced 
state. 



C. PBR Protocol 

In contrast to a fixed Bell inequality used in the SD- 
based or martingale-based protocol, after k trials but be- 
fore the (fc+l)'th trial the PBR protocol returns a special 
Bell inequality of the form 



(Rk(x)) < l 



(8) 



with Rk(x) nonnegative. The PBR p- values are deter- 
mined by the values of Rk at the setting and outcome 
combination Xk+i of the (k + l)'th trial. In particular, 
as shown in Sec. |III C[ any such sequence of inequalities 
yields a valid p-value computed according to 



B. Martingale-based Protocol 



p (PBR) = m j n 



jjfl fc _l(Xfc)) ,1 



(9) 



\k=l 



Another problem with the SD-based protocol is that it 
assumes that the trials are independent and identically 
distributed; that is, it does not consider the memory ef- 
fect |S]. We cannot expect the prepared states and ex- 
perimental settings to be stable over the course of a long 
sequence of trials. In addition, it is desirable to take into 
account the possibility that the experimental system is 
subject to a model of LR where the entire history of the 
experiment can affect the events to come, except that the 
measurement-setting choices are still under independent 
experimental control. To account for these effects, R. Gill 
suggested a method for calculating p- values based on the 
martingale structure of the time sequence of observations 
in a test of LR [5J |B]. 

The martingale-based p-value is computed according 

to 



p(mart) = gxp 



n{I-B) 
32 



(6) 



Here, we assume without loss of generality that I(x) 
and B have been shifted and normalized so that for 
every argument x, the value I(x) is bounded between 
—4 and 4. If the function I(x) in a Bell inequality 
{!{%)) < B does not satisfy this condition, then de- 
termine bi = win x I(x), b u — max x J(i) and replace 
I(x) and B by I'(x) = 8(1 (x) - h)/(b u - b t ) - 4 and 
B' = 8(B — b[)/(b u — bj) — 4. The martingale-based pro- 
tocol is valid, but is based on conservative tail estimates 
and therefore is not asymptotically optimal. For large n, 
I approaches /, thus the asymptotic confidence-gain rate 
is 



6' 



log 2 (e) 



(I-B) 
32 



(7) 



A comparison of SD-based, martingale-based, and 
asymptotically optimal confidence-gain rates is shown in 
Fig. [2] for a CHSH test with noisy and lossy Bell states. 



The PBR protocol aims to optimize the expected p- 
value by computing the PBRs Rk(x) — q£ /Plr mi where 

(k) 

q x is an estimate of the distribution of future setting 
and outcome combinations x, which can be based on 
x±, . . . ,Xk and can take into account other experimental 
information obtained before starting the (k + l)'th trial. 

(k) 

The quantity in the denominator, p}^ , is the probabil- 
ity of x given by an optimal LR model with respect to 

(k) 

the estimates q x . The notion of optimality is defined in 
Sec. IIIC and guarantees the desired inequality (|8|). We 



define the (negative) log-p- value increment for the /c'th 
trial as log 2 (Rk~i(xk)) ■ For independent and identically 

(k) 

distributed trials, q x converges to the true probabilities 
q x , and the asymptotic confidence-gain rate is 



Gpbr — S, 



(10) 



where S„ is the statistical strength defined in Sec. IIII C 



This is the optimal valid confidence-gain rate for a given 
test configuration and is plotted in Figs, [l] and [2] 



D. Application to Experiments 

The above protocols can compute p- values for recorded 
trials as an experiment progresses, and such "running" 
p-values may be used to optimize experimental settings. 
Because we are interested in extremely small p-values 
with exponential asymptotic behavior, we generally con- 
sider and display the (negative) log-p-value. 

SD-based and martingale-based protocols are re- 
stricted to a fixed Bell inequality. The PBR protocol does 
not have this restriction, which enables wider searches 
for strong LR violation. Running log-p- values are shown 
for a simulation in Fig. [3] and for data from Ref. [11] in 
Fig. [4] The PBR p-values were computed with our im- 
plementation of the local realism analysis engine; see the 
documentation and code. Relevant aspects of the imple- 
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FIG. 3: Running log-p-values as a function of the number of trials n in a CHSH test of LR with a Bell state without noise 
or inefficiency. The log-p-values are computed according to the three protocols discussed. The slopes of the straight lines are 
the asymptotic confidence-gain rate for each protocol, (a) is for one simulation of 5000 successive trials, (b) is an average 
of 30 simulations. The square roots of the unbiased estimates of the one-run variances are shown as gray regions around the 
averages and indicate the expected fluctuation for one sequence of n trials for each n plotted. Note that for one sequence, the 
fluctuations are not independent as the sequence progresses. 
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FIG. 4: Running log-p-values as a function of the number of 
trials n in the experiment of Ref. [11]. The dotted lines are 
provided only to guide the eye. 



mentation such as data blocking and learning transients 
are discussed in App. [2] Note that whereas running log- 
p-values can be useful for monitoring and tweaking an 
experiment, they must not be used as a stopping crite- 
rion once an experiment has been configured. 

For Fig. [3] we simulated a CHSH test of LR with a 
Bell state and measurement settings maximizing viola- 
tion of the CHSH inequality ([I]). We assumed an ideal 
experiment (no loss of photons or visibility) and simu- 
lated 5000 successive trials. The log-p- values were up- 
dated for successive blocks of 56 trials (see App. [2j. In 
particular, the function Rk{x) used by the PBR proto- 



col was recomputed based on the trials seen so far every 
56 trials. The figure shows typical and average runs and 
compares the running log-p- values to the asymptotic lines 
with slopes given by the respective gain rates. The slopes 
of the running log-p-values approach the gain rates, but 
PBR log-p-values have a systematic offset that can be 
attributed to an initial transient where the setting and 
outcome distribution is being learned. The transient can 
be removed if, before the experiment is started, we have a 
good estimate of the distribution. Such an estimate could 
be based on theory (quantum or otherwise) or previous 
measurements, and can be used to "prime" the ratios 
R k (x). 



For Fig. |1J we compute log-p-values for the data from 
the experiment described in Ref. |llj . In this experi- 
ment, two 171 Yb + ions separated by about one meter 
were entangled through a probabilistic process. In this 
process, each ion is entangled with one emitted pho- 
ton. By projecting the two emitted photons into a 
Bell state the two remote ions are entangled with each 
other. On the entangled two-ion system, a CHSH test 
of LR was performed. The results from 3016 trials were 
recorded. The resulting estimate of the CHSH expression 
is /chsh = 2.414±0.058. For the figure, we processed the 
data in blocks of 56 trials as before. We did not prime 
the ratios Rk(x) for computing PBR log-p-values. In 
this case, there is insufficient data for PBR log-p-values 
to clearly exceed martingale-based ones. 
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III. THEORY 

For SD-based and martingale-based protocols we fix 
a Bell inequality / < B, as explained at the begin- 
ning of Sec. [D] While the theory applies to multipar- 
tite Bell inequalities, we discuss it explicitly for the case 
of bipartite inequalities to simplify the formulas. (Our 
implementation of the local realism analysis engine is 
presently restricted to the bipartite case.) The setting 
and outcome combination of the fc'th trial is denoted by 
Xk = («fc, jfc, Ofc, b k ), where i k , j k are the /c'th settings and 
a k , b k are the /c'th outcomes of Alice and Bob, respec- 
tively. Let i(x) and j(x) be Alice's and Bob's settings, 
respectively, for the combination x. The distribution of 
measurement settings is fixed. The probability of settings 
i,j is given by p id . 



the unwarranted assumption that, for any LR model, the 
distribution of the estimate /lr of / is sufficiently close to 
Gaussian with the SD a calculated according to the pre- 
vious paragraph but with a mean bounded by B. With 
this assumption, according to any LR model, the prob- 
ability of the event 7lr > I is then bounded above by 
Q((I — B) /er), which allows us to assign the p- value given 
in Eq. Q, with the caveat that our assumption is false. 
The comparisons between SD-based and asymptotically 
optimal confidence-gain rates show that this strategy for 
obtaining p-values is invalid. While it may be possible 
to obtain a valid p-value by checking the relevant aver- 
ages and variances for all LR models, this is a challenging 
task, and one would still have to consider deviations from 
Gaussianity in the extreme tails. 



A. SD-based Protocol 

The obvious method for estimating / is to compute 
the average of the sequential values I{x k ) given by 
/ = — X)fe=i I( x k)- However, this is not the minimum- 
variance estimate of /, since the setting distribution is 
fixed and known. In fact, the conventional way of writ- 
ing a Bell inequality is as a sum of expectations as in 
Eq. ([!]), which makes it independent of the probability 
distribution of the settings. The correspondence between 
the two ways of writing a Bell inequality is given by 

(/(*)) = ^PijilixMx) = i,j(x) = J), (11) 



where the expectation in the sum is conditioned on the 
settings of indicated. If we assume that the state in 
each trial is identical and do not worry about the memory 
and locality loopholes, we can estimate each expectation 
(I(x)\i(x) — i,j(x) = j) separately, experimentally fixing 
the settings for each estimate, if desired. The right-hand 
side of Eq. ^ can then be computed formally. If we 
define n{i, j, a, b) to be the number of trials with settings 
i,j and outcomes a, b, the estimate for / thus computed 



(12) 



a nonlinear function of n(i,j, a, b). Its SD can be approx- 
imated by linear propagation of errors from SDs for the 
counts n(i, j, a,b), assuming each of these counts follows 
a Poisson distribution. The SD thus obtained is generally 
smaller than that of /. Hence, the conventional way of 
estimating / and the experimental SD worsens the valid- 
ity problem for SD-based p-values. However, using the 
estimate / and the associated larger SD in the figures 
of Sec. [H] does not significantly alter the plots or their 
interpretation. 

To convert the number of SDs to a p-value, we make 



B. Martingale-based Protocol 

For fundamental tests of quantum mechanics, a serious 
deficiency of SD-based assessments of experimental tests 
of LR is that they do not account for memory effects [H] , 
including the possibility that the state and settings drift 
in the course of the experiment. To take such effects 
into account, R. Gill 16] considered the time-sequence 
M k = ^2i = i{I{xi) — B) as a super-martingale and applied 
large-deviation bounds. Here, the measurement settings 
are assumed to be chosen randomly and independently 
by Alice and Bob according to the fixed probability dis- 
tribution pij built into the inequality of Eq. ([2| . Let W k 
be all the information available before the /c'th trial. Ac- 
cording to any LR model, the conditional expectation of 
M k given W k satisfies 

(M k \W k ) = (I(x k ) -B + M k _ x \W k ) 

= (I(x k )\W k ) - B + (M k ^\W k ) 
= (I(x k )\W k ) - B + M fe _i 



< M k -i. 



(13) 



The last inequality follows from the fact that the Bell 
inequality ^ is satisfied for any LR model, regardless of 
prior information. The inequality in Eq. (131 is the defin- 
ing property for a super-martingale {M k : k = 1, 2, . . .}. 
This inequality is still satisfied if I(x) and B have been 
normalized and shifted by some constants so that —4 < 
I(x) < 4. With this normalization and shift, each "in- 
crement" M k — M k -\ of the super-martingale is bounded 
between 6; = —4 — B and b u = 4 — B. By applying the 
Azuma-Hoeffding inequality [2"TH2"3"] , we find that, after 
n trials, the probability that an LR model yields an es- 
timate /lr greater than or equal to the observed / is 
bounded above by 

Prob LR (/ LR > J) = Prob LR (M n > n{I - B)) 
t 2n{i-Bf\ 



< exp 



{bu-kf 



(14) 
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This implies a valid p- value of 



1. According to Markov's inequality, we conclude that 



p (mart) = exp 



2n(I -B) 2 



(15) 



Substituting b u — bi = 8 gives Eq. Q. Note that, 
for the CHSH inequality, the expression for martingale- 
based p- values obtained above improve the expression in 
Ref. [Sj and the expression applied to experimental data 
in Ref. [TT] by taking advantage of the bounds on I(x) in 
the formulation of the Azuma-Hoeffding inequality used 
here. 

We cannot expect the bound on the tail probability 
in Eq. ( 14 ) to be asymptotically optimal, since the only 



constraints considered are the bounds of I(x). The PBR 
protocol takes advantage of all available constraints on 
the setting and outcome distributions, implicitly includ- 
ing all relevant Bell inequalities. 



C. PBR Protocol 

Let Rk(x), k = , 1, . . . be a sequence of PBRs as intro- 
duced in Sec. II C They are designed so that < Rk (x) 
and (R k (x)} < 1 for any LR model, provided that the 
setting distribution is pij. Here, Rk may depend on 
xi,...,Xk and other aspects of the experiment before 
starting the (k + l)'th trial. We now show that any 
sequence of Rk with these properties satisfies that the 
p-value computed according to Eq. ^ is valid. 

As in the previous section, we let W k denote all the 
information available before the fc'th trial. Let P k = 
nf=i Rl-i( x l)- According to any LR model with arbi- 
trary memory, the expectation of Pk conditioned on Wk 
satisfies 



(Pk\W k ) = (jlRi-iix^WkJ 

ik-i \ 
= (T[Ri-x{xi) x R k _ x {x k )\W k 
\i=i i 

k-l 

= 11^-10*0 x (Rk-i{x k )\w k ) 
i=i 

< Pk-i, 



(16) 



where we used the facts that Wk includes Ri-i and xi-\ 
for I < k, and that the LR bound on (Rk-i(x)} is 1 given 
W k , as the LR model in the bound is arbitrary. We can 



compute the expectations of both sides of Eq. ( 16 ) to 
show that, according to any LR model, (Pk) < (Pk-i), 
and therefore, by induction, (P k ) < 1. 

Given a sequence of experimental results X\,...,x n 
from n trials, the random variable P n takes a specific 
value P. Suppose that P n is constrained by LR, possibly 
with memory. By construction, P n > and the expecta- 
tion of P n according to this model is bounded above by 



Prob LR (P„ > P) < min(l/P, 1), 



(17) 



which shows that we can assign a valid p- value for reject- 
ing LR by setting p( pBR ) = min(l/P, 1) as in Eq. p]). 
Note that Eq. (16) shows that the sequence P k , k = 
1,2,... is a super-martingale under any LR model. How- 
ever, this super-martingale's "increment" is not bounded, 
so we cannot directly apply the method of Sec. |III B| to 
bound the tail probability. 

For the extremely low p-values of interest in tests 
of LR, we are looking for large log-p-value increments 
log 2 (R n (x n +i)) at the (n + l)'th trial. Therefore, be- 
fore the (n + l)'th trial, our goal is to choose R n (x) so 
as to maximize the experimentally expected increment 
(log 2 (Pri(^ri+i)))- For this purpose, we can take advan- 
tage of anything we know about the probability distri- 
bution of the result x„+i to be obtained at the next 
trial. Consider a probability distribution q for x n+1 , 
which may be either the true distribution or an esti- 
mate thereof. Let p be the distribution according to an 
LR model. Note that, because the setting distribution 
is under experimental control, the probability distribu- 
tions q and p must be consistent with the chosen setting 
distribution. Our ability to distinguish the probability 
distributions q and p given a collection of independent 
samples from q can be characterized by the asymptoti- 
cally optimal confidence-gain rate for rejecting p in favor 
of q. As shown in Ref. [24], this optimal rate is given by 
the Kullback-Leibler (KL) divergence from q to p, 



DKh{q\p) 



]q x \og 2 {q x /p x ). 



(18) 



The KL divergence is nonnegative, and it is zero iff p = q. 
This motivates seeking an LR model whose probability 
distribution plr minimizes the KL divergence from q [7] . 
This is the optimal LR model mentioned in Sec. II C[ Wc 



define S q = -Dkl(<z|plr), and refer to S q as the statistical 
strength for rejecting LR by means of a test with the 
distribution q. 

We claim that if we define R n {x) = q x /phR,x, then 
< R n (x), and for any LR model, the expectation sat- 
isfies (R n (x)) < 1. Consequently, the p-valuc computed 
according to Eq. ([9]) is valid. To prove the claim, con- 
sider 4>((3) = D KL (q\p LR + /3(p-p hR )), where < /3 < 1. 
For any p in the convex set of LR distributions, by opti- 
mality of p LR , 0(/3) > 0(0). It follows that f||,a=o+ > 0. 
Consequently, 



^2(PhK,x ~Px) 



> 0, 



PLR,x 



which can be rearranged to show that 



{Rn(x)) p = ^Px 



< 1. 



PLR,x 



(19) 



(20) 



The claim follows. Bell inequalities of the form shown in 



Eq. (20), which are based on minimizing the KL diver- 



gence, were introduced in Ref. [25 . 

Consider the choice R n (x) = q x /PhR,x made before the 
(n + l)'th trial. If q is the true distribution of x n+ i, then 
the experimental expectation I = (log 2 (Rn(%n+i))} is the 
statistical strength S q . Since f is the expected log-p- value 
increment, which cannot exceed S q [24], this choice of R n 
maximizes the confidence-gain rate. However, we do not 
know the true distribution q. Instead, we obtain good 
estimates q' of q before the (n + l)'th trial, and deter- 
mine the corresponding optimal LR model's probability 
distribution p' hK - We then set R n (x) = q' x /p'^ x ^° com ' 
pute and update the PBR p-value. If the experiment is 
sufficiently stable, good estimates can be obtained from 
the frequencies of events observed in trials so far. The 
estimates can be improved by taking into account that 
the setting distribution is known and the distributions of 
marginal outcomes for given settings of Alice or Bob must 
agree due to the no-signaling constraints. We discuss how 
to do this in App. [2] In App. [3j we show that if the trials 
are independent and identically distributed, then PBR 
p-values computed with any converging method for esti- 
mating the true setting and outcome distribution q have 
the property that the confidence-gain rate approaches the 
statistical strength S q , thus proving asymptotic optimal- 
ity of PBR p-values. 

To determine the optimal LR model one can use nu- 
merical algorithms for optimizing convex functions on a 
convex domain. In this case one can use the expectation- 
maximization (EM) algorithm [53] as discussed in [2"T] . 
A problem is that due to stopping criteria and numerical 
precision, one cannot expect to find the exact optimum. 
We show in App. [3] that one can compensate for this 
problem to maintain validity of the computed p-value. 



IV. CONCLUSION 

The degree of violation of LR in a Bell-type test is usu- 
ally expressed in terms of the number of SDs of violation. 
This quantity cannot, however, be used to obtain valid 
p-values for rejecting LR by conventional means. It also 
fails to quantitatively compare the success of different ex- 
perimental tests of LR and does not account for stability 
issues or memory effects in experiments. We solve these 
problems by providing a method — the PBR protocol — for 
determining valid p-values directly from the settings and 
outcomes in a sequence of trials. The PBR protocol does 
not rely on a predetermined Bell inequality, adapts to 
the actual experimental configuration, and is asymptoti- 
cally optimal for independent and identically distributed 
trials. It therefore provides a standardized measure of 
success for experimental tests of LR. While the protocol 
remains valid if the experiment drifts over the sequence 
of trials, how well it performs depends on the nature of 
the drifts and how the protocol takes them into account. 
Another valid protocol that accounts for memory effects 



can be based on martingale bounds [5[ [B] . This proto- 
col requires a Bell inequality that is fixed for the exper- 
iment. Given the Bell inequality, the martingale-based 
protocol has the advantage that it is computationally ef- 
ficient with respect to number of settings, outcomes, and 
parties. The disadvantage is that it is suboptimal and 
does not provide a clear quantitative comparison of dif- 
ferent experimental tests. Our simulations show that it 
is practical to apply the PBR protocol to data from typ- 
ical experimental configurations, and that the running p- 
values can be used for tweaking an experiment in progress 
to find the experimentally accessible configuration that 
provides the highest violation of LR. 
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Appendix 
1. Statistical Concepts 

A main purpose of the PBR and related protocols is to 
evaluate the strength of the evidence against LR by com- 
puting valid p-values given the data. Some care must be 
taken in interpreting such p-values in terms of probabili- 
ties. For example, the p-value cannot be interpreted as a 
probability that LR is true. Although they are computed 
for the data, their validity is defined in terms of what is 
known before the experiment, not after. Strictly speak- 
ing, we can only state for sure that before performing the 
trials, the following holds: For any fixed < a < 1, if 
LR holds, then the probability that the returned p-value 
satisfies p < a is at most a. Although we have no inten- 
tion of making an actual decision on the failure of LR, 
this statement can be viewed in terms of traditional hy- 
pothesis testing: The protocol tests LR simultaneously 
at all significance levels a, and "rejects" LR at a given 
a if p < a. The validity property is equivalent to the 
statement that, if LR holds, the maximum probability of 
(falsely) rejecting at level a is bounded above by a. This 
justifies the use of p-values to quantify LR violation. The 
definitions of significance levels and p-values are based on 
Ref. [4], 2nd edition, pages 126 and 127. 

The p-values returned by the protocols considered here 
are defined in terms of bounds on the one-sided tail prob- 
abilities of a test statistic T. For given T, it is conven- 
tional to define the p-value of T given data x as the supre- 
mum of the tail-probabilities Prob(T > T(x)) over all hy- 
potheses to be rejected (the null hypotheses). While such 
tight p-values are desirable, they are impractical to com- 
pute in general. Hence our definition of valid p-values re- 
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quires only an upper bound. Note that, for our situation, 
the computation of tight p- values is further complicated 
by the fact that the set of null hypotheses includes all 
possible sequences of LR models depending on previous 
trials. Furthermore, while the statistic is well-defined for 
any realization of the PBR protocol, it is not unique. 

We use the the term "protocol" rather than "test" for 
two reasons. The first is that the term "test" in "test of 
LR" typically refers to the experimental setup and subse- 
quent analysis, not a conventional hypothesis test. The 
second is that hypothesis tests, as the term is used in 
mathematical statistics, are valid by definition. Thus, 
although we do not encourage it, one can think of a valid 
analysis protocol as a family of hypothesis tests. For 
such a family to be useful, the tests should also have high 
power. For our situation, one can express the power in 
terms of the probabilities of rejection at given significance 
levels and non-LR models. Alternatively, one can con- 
sider the expected p-values, and look for tests for which 
they are as small as possible. We do not expect that the 
PBR protocol has particularly low p- values for a given fi- 
nite number of trials. In fact, because of the conservative 
nature of the Markov bounds, better tests exist. How- 
ever, asymptotic optimality of the PBR protocol assures 
us that it performs well when the evidence for rejection 
is very strong. It is also worth noting that many issues 
that arise in applications of hypothesis testing, such as 
selection biases, are less of a concern when one is consid- 
ering the extremely low p- values that are desirable when 
falsifying a physical theory. Corrections for such effects 
reduce log-p-values by relatively small terms in our set- 
ting. Also, one application of the PBR protocol is to 
quantify the success of an experiment independent of the 
details of the configuration, so that different experiments 
can be compared. For this application, the statistical 
interpretation of the p- value serves only as motivation. 

Probability ratios such as the ones we use to compute 
the values of Rk(x) in Eq. ([8| are often referred to as 
likelihood ratios. Likelihood ratios play an important 
role in many statistical tests as explained in statistics 
textbooks such as Ref. 4 . In the PBR protocol, the 
statistic can be computed from any sequence of nonncg- 
ative functions Rk(x) satisfying the inequality in Eq. ([8]). 
Thus, the probability ratios are simply an intermediate 
step to obtaining such functions. We do not ascribe any 
other meaning to the ratios. 



Estimating the Setting and Outcome 
Distribution 



Consider n trials with settings and outcomes given by 
Xi,...x n . Our goal is to obtain an estimate q' of the 
true probability distribution q of the (n + l)'th trial's 
settings and outcomes. Assuming no other knowledge, 
the estimate can be based on the empirical frequencies 
fx = h Sfc=i ^x k ,x- Due to statistical fluctuations, the 
empirical frequencies are not likely to satisfy the follow- 



ing known constraints satisfied by q: 

• Setting distribution: The setting distribution pij 
is fixed, and q satisfies J2 a ,b 1(i,j,a,b) = Pi,j- 

• No-signaling: Given that Alice uses setting i, the 
distribution of Alice's measurement outcomes does 
not depend on Bob's settings, and vice versa. 

There are two other issues for calculating PBR p- values. 
The first is that some empirical frequencies f x may be 
zero. If our estimate is q' = /, zero frequencies can be 
disastrous. In the case where the corresponding settings 
and outcomes occur in the next trial, the ratio contribut- 
ing to the PBR p- value in Eq. ^ can be zero, and then 
the p-value goes to 1 with no possibility of later recov- 
ery. The second and related issue is that in the absence 
of prior knowledge, initially we have insufficient informa- 
tion to make useful estimates of probability distributions 
of future settings and outcomes. Even if the problem 
of zero frequencies has been taken care of, this can still 
result in initial "learning" transients that result in a neg- 
ative offset in the accumulated log-p-values. 

Our approach for estimating the next trial's setting 
and outcome distribution uses maximum likelihood to 
obtain an estimate that respects the above constraints 
and then adjusts the estimate by mixing in a distribution 
that is uniform conditional on the settings. To reduce 
the impact of learning transients, we process the trials in 
blocks. 

To apply maximum likelihood for computing a first es- 
timate go of q, we assume independent and identically 
distributed trials. The probability of observing empirical 
frequencies / after n trials given that the true distribu- 
tion is q is proportional to 



Hf\Q) = U<£ fm 



We therefore set qo according to 



q = argmax, 



q'GV 



(A.l) 



(A.2) 



where V is the set of probability distributions satisfy- 
ing the setting distribution and no-signaling constraints. 
These constraints are linear and log(L(/|<?)) is concave, 
so there is no difficulty in applying available nonlinear 
optimization tools. Note that, for the purpose of calcu- 
lating PBR p- values, it is not critical that Eq. (A.2) is 



exactly satisfied, so it is not necessary to use extremely 
tight stopping criteria to ensure identity with the best 
numerical precision possible. Also, whereas the assump- 



tions underlying Eq. ( 20 1 require that the setting distri- 



bution constraint is satisfied, the no-signaling constraint 
is not critical. Applying it helps improve our estimates, 
but the effect on the log-p- value increments becomes neg- 
ligible for large n. 

There are different ways to solve the problem with em- 
pirical frequencies that are zero; some are explained in 
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Refs. [351 HH] • They generally involve mixing in a distri- 
bution that has no zero probabilities with a weight that 
decreases to zero as n grows. For the plots in Figs. [3] 



and |4| we modified q by setting qi = ^<7o + ^j u , 
where conditionally on the settings, u is uniform, and u's 
setting distribution is pij. 

There are different approaches to mitigating the effect 
of the initial learning transient. The first is to "prime" 
the estimates with knowledge about the experiment avail- 
able before the trials are started. Such knowledge could 
be based on theory or on experiments designed to char- 
acterize the quantum state and measurement setup. The 
prior information must be assigned a weight. In our 
implementation of the local realism analysis engine, the 
weight is determined by the number of trials that would 
have been required to obtain an equally good estimate 
directly from the frequencies. Proper use of priming re- 
quires that the initial estimates and parameters such as 
the weight are determined "blindly" before any knowl- 
edge of the actual data to be analyzed is available. 

A second approach is to set R n (x) = 1 unless the sta- 
tistical strength S for qi's violation of LR seems suffi- 
ciently significant given that the estimated distribution 
qi is based on n trials. While one might expect that 
the violation is sufficiently significant if nS > c for some 
constant c, simulations show that the best choice of c 
depends on the distribution of settings and outcomes in 
the experiment. 

The third and simplest approach is to block the data 
from the trials. Instead of updating the log-p-value af- 
ter every trial, we process data h trials at a time. The 
first block is used only for estimating the setting and 
outcome distribution of future trials. That is, we set 
Rk{x) = 1 for k = 0, . . . , (h — 1). Subsequently, we have 
R-mh+k = Rmh f° r & = lj • • • i {h — 1) and all m. Note 
that neither the validity nor the asymptotic optimality 
of the calculated p-values requires updating the PBRs 
after each trial. Choosing h large enough ensures that 
the first block's trials have sufficient information for ob- 
taining reasonable estimates of the distribution. An ad- 
ditional advantage of blocking the trials is that we avoid 
unnecessarily invoking the computationally costly opti- 
mizations required for updating the PBRs. We standard- 
ized the choice of block size so that if the total number of 
trials to be analyzed is N, h is the maximum of [iV/1000] 
and [ln(2d)d], where d is the number of possible setting 
and outcome combinations in a trial. The first expression 
ensures that we do not lose too much log-p-value by using 
the first block only for learning the setting and outcome 
distribution. The second one is chosen so that if q is 
uniform, the probability that every setting and outcome 
combination occurs is at least 1/2. 

We conclude this section with a note on implement- 
ing the PBR protocol. For monitoring an experiment 
and to adapt to changes in experimental configuration, 
the estimated setting and outcome distributions used in 
the PBRs should be based on recent trials only. This 
can be accomplished by windowing the trials with a win- 



dow large enough to have statistically significant viola- 
tion of LR (if there is violation), but small enough to 
avoid seeing significant changes in configuration. Our 
implementation of the local realism analysis engine uses 
a computationally simpler approach based on weighting 
the trials with exponentially decreasing weights in time 
determined by a configurable half-life. This feature was 
not used in the comparisons in Sec. ITT) 



3. Effects of Suboptimal Estimates and LR Models 

Ideally the estimated distribution q' used in the numer- 
ator of R n matches the true distribution q, and the LR 
distribution p^R m the denominator of R n exactly mini 



mizes the KL divergence from q' . As shown in Sec. Ill C 



having q 1 different from q does not affect the validity of 
the PBR p-values. But it can reduce the expected log-p- 
value increment I. Let S q be the statistical strength of q 
for LR violation. We show that 



S q >l>S q -D KL (q\q r ). 



(A.3) 



For reasonable methods of estimating q' such as the one 
described in App.|"2|and independent and identically dis- 
tributed trials, q almost surely approaches q so that 
Dkl {q\q') goes to zero. This shows that the PBR proto- 
col has asymptotic confidence-gain rate S q . 



To prove the first inequality in Eq. (A.3), let plr be 



the LR distribution that minimizes the KL divergence 
from q, so that S q — £>kl(9|plr)- We bound 7 as follows: 



S„ -I 



log 2 (q x /t x ), 



(A.4) 



where we define t x = Plr,sQ4/Plr jK - Since Ix/p'lr.x is 
the PBR, and plr is an LR distribution, we know that 
c = J2 X tx < 1 [Eq. (20)]. Since t' = t/c is a probability 
distribution, we can continue the calculation: 

S v -l = log 2 (l/c) + 1- lo &2(qx/t' x ) > 0, (A.5) 

x 

because the second term is a KL divergence. 



To obtain the second inequality of Eq. (A.3 1 we bound 
I = J2qxlog 2 (q' x /p' LR , x ) 

X 

= J2lx log 2 (WPLR,J - Qx \og 2 {q x /q' x ) 



= £>kl(<zKr) - DklW) 
> Acl(<z|plr)--Dkl(?|</) 

= S q -D KL (q\q>). 



(A.6) 



The denominator p^R °f the PBRs R n must be com- 
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puted numerically. Consequently, the distribution p' e LR 
actually obtained is typically not identical to p^R an d 
may not minimize the relevant KL divergence. Hence, 
there may be an LR distribution p, for which (R' n (x)) p = 
(l'x/Pe,hR x)p i s greater than 1, and so the PBRp- value is 
not valid if it is computed according to Eq. ^ with R' n . 
To maintain validity, we determine the maximum value 
1 + e of (R' n (x)) p for all LR distributions p and then set 
R n = R' n /(l + e). To determine the bound 1 + e, we 
recall that LR distributions are mixtures of distributions 
px induced by "local hidden variables" A. Each A assigns 
deterministic outcomes independently for each setting of 
Alice and each setting of Bob. We write A^,, and Xbj 
for Alice's and Bob's measurement outcomes given set- 
tings i and j, according to A. The probability for the 
setting and outcome combination x — (i,j,a,b) is given 
by P\,(i,j, a ,b) = Pi,jda,\ A ,i S b,\ B j- With these definitions, 



can be calculated according to Eq. (A.7). The EM algo- 



1 + e = max ■ (q' x /p[ 1,lr ,x)p = m ^y2PX,xq'x/Pe, 

« is LR a *■ — ' 



(A.7) 

Because the number of different A is finite, the value 1 + e 



rithm that we apply to KL-divergence minimization iter- 
atively updates the probability distribution over the set 
of hidden variable assignments A. To perform the up- 
dates requires the set of values that are maximized in 
Eq. (A.7), so the computation of 1 + e can be integrated 



into the algorithm with little overhead. Furthermore, the 
quantity e can be used as a stopping criterion for mini- 
mization. That is, the expected log-p- value increment Z e , 
assuming that the result x is distributed according to q' , 
satisfies 

Te = ^q' x \og 2 {q x /{p e ^ KtX {l + e))) 



= iW<?V e)LR )-log 2 (l + e) 
> D K L(<7V L R)-log 2 (l + e). 



(A.8) 



Thus, for independent and identically distributed trials, 
the asymptotic confidence-gain rate is lowered by at most 
log 2 (l + e). 
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